Let’s play together: Collaborative Data Science
Data Science Conference 4.0
Mario Annau
September 19, 2018
Why is it so hard?
- Data Science is an interdisciplinary field.
- Most scientists care more about methods than code.
- Most engineers care more about code than methods.
- Psychological barriers exists for people to collaborate.
Why is it so important?
- Review of models and code improves overall quality.
- Collaboration can generate new ideas.
- Network effects if more people work efficiently together.
Improving Network Effects
- How can code be managed to have positive network effects?
- How can teams efficiently communicate and collaborate together?
Case study: The CRAN package repository
Package Redundancy
- Lack of communication between authors can lead to redundant packages.
- Redundancy not helpful for infrastructure packages.
- Example: R-Excel Package
- Example: HDF5 package development
HDF5 packages
- Store large amounts of data, e.g. tick data
- Unsatisfied with rhdf5, hdf5, h5r, … → h5

2 years ago …
- Presentation of h5 at R/Finance 2016
- Rcpp to interface HDF5 C++ API
- Basic HDF5 features implemented
… 2 months later …
On June 21, 2016 Holger wrote:
… my name is Holger Hoefling, I have developed a new version of a wrapper library for hdf5 (R6 Classes, almost all function calls wrapped, full support for all datatypes including tables etc) …
And I replied:
On June 21, 2016 Mario wrote:
sounds interesting!
What’s different in hdf5r?
- Automatic code generation against HDF5 C API
- Usage of R6 (instead of S4) classes
- Close connections during garbage collection
- Broad coverage of low-level library features
Merging codebases
- Maintain high-level interface and test cases from h5
- Get low-level HDF5 support within R

On Oct 10, 2016 Holger wrote:
thanks - merged!
The Joys Collaboration
(after overcoming psychological barriers)
- Code reviews
- Higher Quality Code
- End product of higher qualtity than separate packages.
Q: How can code be managed to have positive network effects?
- Put it into re-usable package.
- Continous code-reviews and tests.
- Transparent platform to inspect.
Q: How can teams efficiently communicate and collaborate together?
- Have the right tools and mindset in place.
- Incentivise collaborative efforts.
- Accept unexpected hypotheses and failures
- Open mindedness.